Semantic Annotation of Nouns

Introduction

In this file we will go through the annotations and clouds of 8 nouns: horde, hoop, spot, staal, stof, schaal, blik, spoor. For each of them, the sense distribution, a sort of confusion matrix and a description of the clouds will be shown. The descriptions can still go much deeper; for now the priority is an overview of the possibilities and variation across lemmas before going too deep into each of them. At the same time, they are still only descriptions and no conclusions are being drawn from them yet, so beyond this introduction the rest is not summarized yet.

Sense distribution

For all the cases we have at least two homonyms, of which at least one is polysemous. The sense tags have codes described in a table with definitions at the beginning of each section, but the annotators also had the option of assigning a geen ‘none of the above’ tag, in which case they had to add an explanatory comment.

When setting up the annotation procedure, pilot batches of 40-50 concordance lines of each type were collected to estimate the frequency of the senses we expected (we did have to exclude candidates because some sense was not frequent enough). The annotation of the pilot sets was not extremely thorough and sometimes we modified the definition set afterwards, so it’s best to keep that in mind when reading the comparison between the expected distribution and what came out of the annotations. It’s worthy to consider that if my estimates over samples of 40-50 tokens match what we find in a bigger sample (and especially if it’s robust across batches) it’s quite encouraging. If (the skewness of) the sense distribution turns out to be a factor in the topology of the clouds, it’s useful to know that it can be estimated from such a small sample.

The Sense distribution subsection of each section compares then the estimated sense distribution based on the pilot concordances with the one found in each batch and the whole set of tokens. For each type a plot is shown with a row of dots per batch over a line and two more rows below the line representing the pilot-based estimate and the overall distribution. Circle dots represent tokens tagged with a given sense by the majority of the annotators (each sense has a color) and triangles represent either tokens primarily annotated with the geen ‘none of the above’ tag or tokens for which the annotators did not agree at all. The dots in the batch rows represent one token each and their transparency codes the mean confidence after standardizing its value by annotator and lemma.

Confusion matrix

For each type, two confusion matrices will be shown in the Confusion matrix subsection. In each of them, a row represents a majority sense (or no_agreement if there was no majority) and a column represents a sense tag. By hovering over the row names it’s possible to retrieve the definition, since the tags are not precisely transparent; columns are also grouped by homonym. Since the geen ‘none of the above’ tags can have different reasons, the annotators’ explanations were classified with the following tags:

  • between, when the annotator reported doubt between two or more of the given senses;
  • not_listed, when the explanations referred to a sense that was not contemplated in the list of senses (or not understood as such);
  • unclear, when the explanations referred to either insufficient or unclear context (or simply difficulty to understand, such as “geen flauw idee”), and
  • wrong_lemma, when they referred to an issue with lemmatization, part-of-speech tagging (including parts of proper nouns) or even spelling, so that the target didn’t actually correspond to what was meant to be annotated.

The first matrix shows raw annotation counts. Each cell tells the number of tokens with the majority sense of the row that were tagged with the sense of the column: The cell in the row of horde_1 and the column horde_2 will say how many tokens with the majority sense horde_1 received some horde_2 annotation. The totals indicate the number of tokens that were tagged with a given sense (for column totals). The first descriptions only focus on which senses are confused with each other. The caption also records the proportion of tokens with a certain majority sense or homonym that received the same tag from all annotators.

The second (“weighted”) matrix shows the mean of the mean confidences of the annotations. Suppose the row is horde_1 and the column is horde_2; to fill in such a cell, for each token with majority sense horde_1 the mean of the confidences of the horde_2 annotations is computed. Since the horde_2 is not the majority sense, there won’t be more than one annotation of the same token to average across; for the horde_1 column, each token would have two to four agreeing annotations, and their respective confidences would then be averaged to reach one mean confidence per token per sense. The final value of the cel is the mean, across all tokens of that cell, of those mean confidences. Here it is important to take into account that the annotators had to assign senses rather than homonyms: very often, the disagreement between sense annotations becomes agreement between homonym choices; I expect the same happens when confidence ratings are low in polysemous items.

Nephology

For each type, the Nephology subsection discusses the role of the parameters in the structure of the cloud of clouds (level 1 of the visualization) showing at least two color coded plots, and then compare sets of models (level 2). First, parameters that seem to have little to no effect in the variation between models are kept constant to compare the resulting selection; then other combinations that might provide different results are explored, and finally some combination of parameters that seems to provide “satisfying” models is kept constant look at the actual effect of the less important parameters. Normally, the strongest ones are those that select first order context features, while the second order parameters rarely make much of a difference.

The comparison between models normally takes the following steps:

  1. examine the range of distances between the models through the distance matrix;
  2. describe the general look of the clouds without color coding and how they change between MDS and t-SNE solutions;
  3. color code with homonym and sense tags and describe the revealed structure.

The description includes the behaviour of outliers in MDS solutions, the separability of homonyms/senses in any kind of solutions, how many and how clear clusters show up in the different t-SNE solutions (if there is any that provides a particularly good representation, how robust it is across different perplexities) and how such structure relates to the parameters under comparison. For now, individual tokens are not examined but on an exceptional basis. A certain bias should be acknowledge: certain settings tend to be preferred (from theoretical reasons sometimes, but not always) and the findings in one type definitely affect how the following ones are understood. Hopefully, time and experience will provide the tools to revise these decisions with better criteria.

At the end of the subsection a next course of action is suggested, such as promising model(s) and which tokens seem interesting to look at. That is highlighted in a nice quote block at the end of each section.

To ease the descriptions, the parameters will be written in all caps and their values will follow them separated by colon. These are:

First order part-of-speech (FOC-POS)
Can take the value FOC-POS:nav, when only nouns, adjectives and verbs were selected as first order features, or FOC-POS:all, when there was no such restriction (still, some part-of-speech tags were ignored always, such as interjections).
The tendency is to default to FOC-POS:nav, since function words are probably less informative (kind of linguistically informed default).
First order window (FOC-WIN)
Can take the value FOC-WIN:5, when only features within a 5-5 window of the target were included, or FOC-WIN:10, when a 10-10 window was used.
The tendency is to default to FOC-WIN:10, to allow for more information; normally relying on other restrictions is enough to filter out the noise.
Positive pointwise mutual information as filter (PPMI)
Can take the value PPMI:weight when the second order vectors are weighted by the PPMI value between the first order feature they represent and the target type, PPMI:selection when only features with a positive PMI with the target type were included but the vectors were not weighted, and PPMI:no when no such filter was applied. Normally, the models with PPMI:selection are more similar to PPMI:no than to PPMI:weight and they are not considered in the initial comparisons.
The initial tendency was to default to PPMI:no, since a high PPMI value signals a feature as characteristic of the type rather than of groups of it (like a sense), but in the analyses described in this file it never performs as well as the alternatives.
Vector length (LENGTH)
Can take the values LENGTH:5000 and LENGTH:10000 when the 5000/10000 most frequent features were used as second order dimensions, or LENGTH:FOC when the same first order dimensions are used for the second order. That means that their number and frequency depends on the result of the first order restrictions for that particular sample of tokens. Normally, while this is not an extremely strong parameter, LENGTH:FOC can make a different against the other two, frequency based values.
The tendency is default to LENGTH:FOC because it should be better tailored to the specific context of the tokens in the cloud; it’s harder to compare clouds with different first order context words, but it does seem to perform better in most cases. Between frequency based values, I almost never look at LENGTH:10000, since it almost never seems to make much of a difference, but I probably should look into it before discarding it from future clouds. If both frequency based settings perform very similarly, smaller numbers should be preferred (hence the tendency to choose LENGTH:FOC as well, since it normally means fewer than 5000 dimensions).
Second order part-of-speech (SOC-POS)
Can take the values SOC-POS:nav or SOC-POS:all and refers to a filter on the second order dimensions. This never makes much of a difference.
The tendency is to default to SOC-POS:nav (De Pascale, 2019, pp. 62–63).
Second order window (SOC-WIN)
Can take the values SOC-WIN:4 or SOC-WIN:10 depending on whether the PPMI values for the second order vectors were computed based on a 4-4 or 10-10 window.
This parameter seems to group models for some types, but doesn’t really affect the structure of the clouds that much as far as I can see. The tendency is to default to SOC-WIN:4 (See De Pascale, 2019, pp. 62–63). Could it be that in the cases where it seems relevant, what actually happens is that all the other parameters are just too weak?

Eventually, it would be nice to reinstate sentence boundary as parameter (replacing for example SOC-POS). The difference between LENGTH:5000 and LENGTH:10000 also seems neglectable.


horde

The noun horde was tagged with 3 definitions, reproduced in Table 1. The homonyms are roughly equivalent to English ‘horde’ (horde_1) and ‘hurdle’, which can be a literal obstacle, particularly in sports (horde_2) or figurative (horde_3). The first homonym is estimated (based on a 40-token sample) to be much more frequent than the second, but clearly distinguishable from it. While the two senses of the second homonym are quite distinct, depending on the clarity of the context there could be some overlap in the annotations of senses within the second homonym.

Table 1. Definitions of ‘horde’.
code definition example freq
horde_1 1 bende, ordeloze groep personen een woeste horde 26
horde_2 2.1 materiële hindernis, m.n. houten raamwerk gebruikt bij het hordelopen de 400m horden bij de vrouwen 5
horde_3 2.2 hindernis in figuurlijke zin een horde nemen 8

Sense distribution

The sample consists of 280 tokens (7 batches) out of 3224 occurrences in the QLVLNewsCorpus; the distribution of the majority senses of each batch, as well as the pilot-based estimate and the overall distribution, are reproduced in Figure 1.

As estimated, the first homonym tends to be the most frequent, with at least half the tokens of each batch, and between the senses of the second homonym, the figurative one tends to be more frequent. The only exception is the seventh batch, where a huge majority of the tokens was tagged with the literal “hurdle” sense. I checked the concordance and they are correctly tagged. In any case, the overall distribution is very similar to the estimated one.

“horde” is a noun with two homonyms of very different frequencies, where the least frequent homonym is polysemous with two senses of similar frequency.

Figure 1. Distribution of majority senses of 'horde' per batch

Figure 1. Distribution of majority senses of ‘horde’ per batch

Confusion matrix

The confusion matrix between the majority senses and other tagged senses can be seen in Table 2 (raw number of tokens with such senses assigned) and Table 3 (mean confidence of such sense annotation in each token).

We would expect no confusion between horde_1 ‘horde’ on one side and the others (literal and figurative ‘hurdle’). There are, however, 7 (out of 280) tokens present an annotation that confuses homonyms. While an inspection of the concordances shows that the majority sense tag is quite straightforward, this discrepancy could be taken into account while looking at the clouds.

There are also 11 cases where some annotator thought the given options didn’t apply (not_listed column); the 10 cases of ‘horde’ are applied to non-human entities, and the annotators resisted tagging them with a sense that specified “people”. Only two tokens presented no agreement between annotators (total of the no_agreement row).

Table 2. Non weighted sense matrix of ‘horde’ senses. Proportion of tokens with full agreement per sense-tag is: horde_1: 0.89, horde_2: 0.88, horde_3: 0.86. Proportion of tokens with full agreement per homonym is: horde: 0.89, hurdle: 0.95.
horde
hurdle
geen
senses horde_1 horde_2 horde_3 between not_listed unclear
horde_1 168 3 3 0 10 3
horde_2 1 57 6 0 0 1
horde_3 0 6 51 0 1 0
unclear 0 0 1 0 0 1
no_agreement 0 3 3 1 0 1
total 169 69 64 1 11 6

The weighted matrix shows that there is a relatively high mean confidence in the annotations that became majority senses, and a relatively lower one in the disagreeing annotations. It is remarkable, however, that the one annotation that assigned horde_1 ‘horde’ to a majority horde_2 ‘lit. hurdle’ token had the maximum confidence. The concordance of such token is reproduced in (1).

  1. ABN Amro signaleert een vangnet aan de onderkant rond de 575 punten en een ’ horde ’ op 610 . Breekt de AEX door het niveau van 610 heen ,
    ABN Amro signals a safety net below around the 575 points and a ‘hurdle’ at 610. The AEX brakes through the level of 610,
Table 3. Weighted sense matrix of ‘horde’ senses
horde
hurdle
geen
senses horde_1 horde_2 horde_3 between not_listed unclear
horde_1 4.58 3.33 3 0 4 3
horde_2 5 4.69 3.17 0 0 0
horde_3 0 3.5 4.33 0 0 0
unclear 0 0 2 0 0 1.5
no_agreement 0 4 3.5 4 0 4

Nephology of horde

A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 144 models of horde created on 10/03/2020, modelling between 262 and 279 tokens. The stress value of the MDS solution for the cloud of models is 0.161.

The parameters with a stronger effect in the distinction between the models are FOC-POS and PPMI (with the biggest difference between PPMI:weight on one side, and PPMI:selection or PPMI:no on the other, see Figure 2), and then FOC-WIN (see Figure 3); second order parameters seem to have a minimal effect. The stress values of the MDS solutions of these models range between 0.105 and 0.254

Figure 2. Cloud of models of 'horde' colored by `PPMI`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=horde'> here</a>.

Figure 2. Cloud of models of ‘horde’ colored by PPMI. Explore it here.

Figure 3. Cloud of models of 'horde' colored by `FOC-WIN`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=horde'> here</a>.

Figure 3. Cloud of models of ‘horde’ colored by FOC-WIN. Explore it here.

To compare the models, we will first keep SOC-POS:nav + SOC-WIN:4 + LENGTH:FOC constant while varying along the other three variables. We’ll ignore PPMI:selection, since for that variable the greatest difference seems to lie between PPMI:weight and the other two. Looking at the distance matrix between the corresponding clouds (Distance matrix 1), the one with the looser filters seems to be the most different to the rest, with the most similar model being the one that only restricts window size (distance of 0.24), followed by the one that only restricts part-of-speech (distance of 0.39). The three models that restrict window size and something else (either PPMI, FOC-POS or both) seem to be the most similar to each other, with distances between 0.1 and 0.29.

Distance matrix 1. Distance matrix between some models of ‘horde’
mapIndex foc_foc_pos foc_ppmi foc_foc_win
1 nav weight 10_10
2 nav no 10_10
3 all weight 10_10
4 all no 10_10
5 nav weight 5_5
6 nav no 5_5
7 all weight 5_5
8 all no 5_5

Given this selection, we’ll compare the results from MDS and t-SNE with different levels of perplexity. Some t-SNE models at some levels of perplexity (mainly from 20 upwards, and with increasing clarity) show some pockets of tokens (particularly models that almost seem to have “pockets” in MDS).

The MDS solutions with PPMI:no are more susceptible to outliers: the largest part of the cloud is crammed around the center, taking distance from a couple of single tokens. This doesn’t seem to bother the t-SNE solutions, although those tokens do tend to occur in the periphery of their subclouds. The homonyms form two distinct hemispheres, so to speak, and the senses of the second homonym are clear enough as well, particularly in the models with FOC-POS and/or PPMI filters. Sometimes, the figurative tokens are too widely spread.

Regarding the t-SNE solutions (see Figure 4 for an example), clusters are hard to identify with perplexity of 5 (unless color-coded), and then they are more clear-cut in the PPMI:weight models: there is a big cloud of horde_1 ‘horde’ and then two smaller clouds of horde_2 and horde_3 respectively (literal and figurative ‘hurdle’). There are still some tokens between the clear clusters, but they don’t seem to persist across models. If we include the PPMI:selection, we can see it works better than PPMI:no (which is quite bad) but not much worse than PPMI:weight.

Figure 4. Tokens of 'horde' in the t-SNE solutions (perplexity 30) of the selected models

Figure 4. Tokens of ‘horde’ in the t-SNE solutions (perplexity 30) of the selected models

If we keep the first order parameters constant, the second order parameters seem to have little to no effect; the resulting distance matrices seem to rarely have a value over 0.18. LENGTH:FOC seems to have a bit more of an effect if there are no first order filters, but the difference is more evident in the MDS solutions (the LENGTH:FOC models are less sensitive to the outliers); looking at the distance matrices, it’s really not so strong.

All in all, the FOC-POS:nav models seem better than their FOC-POS:all counterparts; and FOC-WIN:5 looks better than FOC-WIN:10 only for FOC-POS:all in the MDS solution. (By better, I mean more distinct subclouds; less clutter and yuxtaposition… I should compare my assessments with actual measures.)

For deeper insight I will look at models with FOC-WIN:10 + FOC-POS:nav + SOC-WIN:4 + SOC-POS:nav, comparing across all PPMI values and between LENGTH:FOC and LENGTH:5000.

Oultiers (MDS solutions)

The outliers are two tokens with only one relevant –and rather infrequent– context word and a concordance in French.1 The most problematic are (2) and (3), where the only surviving context word is kad/noun (abbreviation of kadetten). The problem is likely that there is only one surviving context word with a rather low frequency (226) but therefore a high PPMI (5.52), which might lead to a rather sparse vector with LENGTH:5000 | LENGTH:10000. Since the context word is rather linked to the context of the target, the LENGTH:FOC vectors are not so problematic. While they remain outliers in all models, they skew them less when either PPMI:weight or FOC-POS:nav filters are applied and even less with LENGTH:FOC. These tokens also tend to be peripheric in the t-SNE models.

  1. ) 8.19 . David Palinckx ( ABES ) 7.84 . 60^m horden 0,914 ( kad ) : Wim Marynissen ( AVKA ) 9.80 .
    ) 8.19 . David Palinckx ( ABES ) 7.84 . 60^m hurdles 0,914 ( cadets ) : Wim Marynissen ( AVKA ) 9.80 .
  2. 400m horden ( sch-2de reeks ) : 1. Shaun Malone ( ACBR ) 59.22 300m horden ( kad ) : 1. Eri Van Vosselen ( SWIN ) 44.05 Ver ( sen
    400m hurdles ( sch-2nd lap ) : 1. Shaun Malone ( ACBR ) 59.22 300m hurdles ( cadets ) : 1. Eri Van Vosselen ( SWIN ) 44.05 Ver ( sen

When the power of (2) and (3) are cancelled, the other outlier comes out: (4), a fragment from a song in French, Les colonies. Here, if there are no filters, the words et, les and de are counted; the FOC-POS:nav filter excludes de because it’s tagged as a determiner (while the others are tagged as nouns!) and a PPMI filter discards les because of its negative PMI with the target. (4) remains consistently an outlier, and rather peripheric in t-SNE models.

  1. pour entendre ’ au secours ’ Où sont passés les baobas et les hordes de gosses Dans cette ère de négoce où ne vivent que les big
    to hear “Help” Where did the baobas and the hordes of children go. In this era of trading where do they live but the big

hoop

The noun hoop was tagged with 3 definitions, reproduced in Table 4. The homonyms are roughly equivalent to ‘lot/pile/bunch’ (hoop_1 in the concrete, specific sense; hoop_2 in the broader sense of ‘a lot of…’) and ‘hope’ (hoop_3). The second homonym is expected to be much more frequent than the first one and very easy to distinguish from it; the first one is not only polysemous but inbalanced in the frequency of its senses and highly dependent on the specificity of the context for a confident distinction between them.

Table 4. Definitions of ‘hoop’.
code definition example freq
hoop_1 1.1 ongeordende stapel een hoop rommel, gooi maar op de hoop 1
hoop_2 1.2 grote hoeveelheid een hoop mensen, een hele hoop geld 10
hoop_3 2 positieve verwachting, vertrouwen op iets positiefs hoop koesteren, de hoop uitspreken dat… 28

Sense distribution

The sample consists of 320 tokens (8 batches) out of 41946 occurrences in the QLVLNewsCorpus; the distribution of the majority senses of each batch, as well as the pilot-based estimate and the overall distribution, are reproduced in Figure 5. As expected, hoop_3 ‘hope’ is overwhelmingly frequent and hoop_2 ‘pile, gral.’ is more frequent than hoop_3 ‘pile, spec.’ (which also seems, at first glance, to be tagged with low confidence). The sense distribution is relatively stable and the most infrequent sense almost always occurs at some point.

“hoop” is a noun with two homonyms of very different frequencies, where the least frequent homonym is polysemous with two senses of different frequencies.

Figure 5. Distribution of majority senses of 'hoop' per batch

Figure 5. Distribution of majority senses of ‘hoop’ per batch

Confusion matrix

The confusion matrix between the majority senses and other tagged senses can be seen in Table 5 (raw number of tokens with such senses assigned) and Table 6 (mean confidence of such sense annotation in each token). Ideally, there would be no confusion between hoop_1 and hoop_2 on one side and hoop_3 on the other. Indeed, there are only 4 cases, tagged primarily with hoop_3 ‘hope’, where some annotator also assigned a sense tag of the other homonym. While the concordances for those tokens are not the most clear (especially on a tired mind), the majority senses are indeed correct. They could still be interesting to look up in the clouds. For the rest, there are very few that couldn’t be assigned a sense and no cases without any agreement at all.

Table 5. Non weighted sense matrix of ‘hoop’ senses. Proportion of tokens with full agreement per sense-tag is: hoop_3: 0.96, hoop_2: 0.86, hoop_1: 0.65, wrong_lemma: 0.5. Proportion of tokens with full agreement per homonym is: hope: 0.96, pile: 1, geen: 0.5.
pile
hope
geen
senses hoop_1 hoop_2 hoop_3 unclear wrong_lemma
hoop_1 17 6 0 0 0
hoop_2 8 59 0 0 0
hoop_3 1 3 240 3 3
unclear 0 0 1 2 1
wrong_lemma 0 0 1 0 2
no_agreement 0 0 0 0 0
total 26 68 242 5 6

The weighted matrix (Table 6) shows a relatively high confidence in the agreeing annotations compared to the disagreeing ones. One curious case is a token where two annotators assigned a geen tag with minimum confidence and reported to be quite lost in the meaning of the expression and the other very confidently (maximum confidence) assigned a hoop_3 ‘hope’ tag. The concordance is reproduced in (5).

  1. het leukste speelgoed . Van het oorspronkelijke dierenbestand van het park blijven op dit ogenblik hoop en al één lama , enkele herten en drie pauwen over . Navraag leerde ons
    the nicest toy. From the original animal stock of the park there remain at this moment more or less (lit. bunch and all [the rest]) one llama, some deers and three peacocks. Further enquiries let us know

Here the target is part of a fixed expression, “hoop en al”, derived from hoop_1 ‘bunch’; I should check when inspecting the cloud if there are other instances of it, how they are positioned and how they were tagged.

Table 6. Weighted sense matrix of ‘hoop’ senses
pile
hope
geen
senses hoop_1 hoop_2 hoop_3 unclear wrong_lemma
hoop_1 4.72 4.5 0 0 0
hoop_2 3.62 4.58 0 0 0
hoop_3 3 3.33 4.71 1 2.33
unclear 0 0 5 0.75 1
wrong_lemma 0 0 2 0 4.5
no_agreement 0 0 0 0 0

Nephology of hoop

A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 144 models of hoop created on 10/03/2020, modeling between 298 and 317 tokens. The stress value of the MDS solution for the cloud of models is 0.17.

The cloud of models (Figure 6) has two clear groups divided by FOC-POS along the first dimension, and three clear groups within each of them based on the PPMI, with PPMI:selection between PPMI:no and PPMI:weight but closer to the former than to the latter. Within each of those groups, FOC-WIN draws divisions (Figure 7). The stress values of the MDS solutions of these models range between 0.237 and 0.317.

Figure 6. Cloud of models of 'hoop' colored by `PPMI`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=hoop'> here</a>.

Figure 6. Cloud of models of ‘hoop’ colored by PPMI. Explore it here.

Figure 7. Cloud of models of 'hoop' colored by `FOC-WIN`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=hoop'> here</a>.

Figure 7. Cloud of models of ‘hoop’ colored by FOC-WIN. Explore it here.

To compare the stronger variables, we’ll first keep the lighter ones (SOC-POS:nav + SOC-WIN:4 + LENGTH:FOC) constant, ignoring also PPMI:selection at the beginning. Without color coding, the clouds look very similar to each other, quite round and with few outliers (more with PPMI:selection than with PPMI:weight). Models with FOC-POS:nav lose some tokens, more with FOC-WIN:5, even more with PPMI:weight and 19 with all those filters applied.

The clouds are strongly dominated by tokens of the first homonym, which is highly frequent. Those of the first homonym do tend to stick together in the MDS models, although not distinctly separate from the other homonym. It also kind of groups together in the t-SNE models, in more disperse clusters in the weighted model (although everything is more disperse in those), and never really separate from the bigger cloud of hoop_3 ‘hope’ tokens.

The model with FOC-WIN:5 + FOC-POS:all + PPMI:selection looks in t-SNE like a pretty MDS model, and much better with LENGTH:5000.

When the weaker variables are fixed, values in the distance matrix range between 0.5 and 0.93 (between the strictest and the least strict PPMI), with three pairs of models of smaller distance. Other selection of weak variables seem to return similar results. When the stronger variables are fixed, values in the distance matrix range between 0.14 and 0.7, sometimes going higher or lower.

The skewness in frequency seems to make “hoop” hard to model; I would like to look deeper into models with FOC-WIN:5 + FOC-POS:all + PPMI:selection + LENGTH:5000.

Outliers

There is an evident outlier in some MDS clouds, (6). The annotators all agreed on assigning hoop_3 ‘hope’ (which is right, but inside the proper name Hoop op Zegen ‘Hope for a blessing’), although with low confidence. It does make me want to review the low confidence cases.

  1. van de BWB . Wielrennen : Tom Franssens en Dario Di Dio verlaten Hoop op Zegen - Wielerclub Hoop op Zegen Beveren verliest twee van zijn smaakmakers .
    of the BWB. Cyclism: Tom Franssens and Dario Di Dio leave Hoop op Zegen. Hoop op Zegen cycling club in Beveren loses two of its stars.

spot

The noun spot was tagged with 3 definitions, reproduced in Table 7. The homonyms mean roughly “ridicule” (spot_1) and “spot(light)”, with a literal (or metaphorical) spotlight for spot_3 and, metonymically, a videoclip for spot_2. The two homonyms have similar frequencies, as well as the two senses of the polysemous one, but a relatively high number of challenging tokens is expected.

Table 7. Definitions of ‘spot’.
code definition example freq
spot_1 1 oneerbiedige, ridiculiserende uitspraak of behandeling de spot drijven met, bijtende spot 14
spot_2 2.1 reclameboodschap via radio, televisie, bioscoop een spotje voor tandpasta 9
spot_3 2.2 schijnwerper de spots richten op 7

Sense distribution

The sample consists of 240 tokens (6 batches) out of 3496 occurrences in the QLVLNewsCorpus; the distribution of the majority senses of each batch, as well as the pilot-based estimate and the overall distribution, are reproduced in Figure 8.

While some batches have more skewed distributions (the first one has mainly “ridicule” cases and the sixth one, “spotlight” cases), the overall distribution resembles the estimated one, with fewer cases of ambiguous tokens but still rather balanced frequencies.

“spot” is a noun with two homonyms of similar frequency, one of which has two senses of similar frequency.

Figure 8. Distribution of majority senses of 'spot' per batch

Figure 8. Distribution of majority senses of ‘spot’ per batch

Confusion matrix

The confusion matrix between the majority senses and other tagged senses can be seen in Table 8 (raw number of tokens with such senses assigned) and Table 9 (mean confidence of such sense annotation in each token).

We would expect no confusion between spot_1 ‘ridicule’ on one side and spot_2 ‘videoclip’ and spot_3 ‘spotlight’ on the other, but also rare confusion between the senses of the second homonym. It’s indeed the case that a very small number of tokens with one of these senses as majority sense received a tag from another sense; most of the confusion comes from tags like unclear and not_listed. In the spot_3 ‘spotlight’ cases with not_listed tags, the annotators suggested that it might be referring to the name of a magazine – the concordances are indeed quite particular and similar to each other, so we could expect them to cluster in the clouds. There is also a strong group of tokens with not_listed as the majority sense: there, the target item is part of the English expression hot spot, and the annotators suggested the meaning of “place” and/or pointed out that it’s from English.

Table 8. Non weighted sense matrix of ‘spot’ senses. Proportion of tokens with full agreement per sense-tag is: spot_1: 0.97, spot_3: 0.77, spot_2: 0.77, not_listed: 0.84. Proportion of tokens with full agreement per homonym is: ridicule: 0.97, film/spotlight: 0.81, geen: 0.68.
ridicule
film/spotlight
geen
senses spot_1 spot_2 spot_3 between not_listed unclear wrong_lemma
spot_1 105 0 1 1 0 1 0
spot_2 3 47 4 0 1 3 0
spot_3 0 0 62 0 1 3 10
not_listed 0 0 3 0 19 0 0
unclear 1 1 0 0 0 3 1
no_agreement 1 1 3 0 0 3 3
total 110 49 73 1 21 13 14

The confidence assignments in agreement are quite high, with a relatively low confidence for the hot spot group (not_listed row and column). The unexpected here is to see a mean confidence of 5 for the three cases where the majority sense tag referred to spot_2 ‘videoclip’ but the minority sense to spot_2 ‘ridicule’, a different homonym. They could be considere ambiguous, particularly if primed with other input.

  1. mensen hier namelijk knedliky om hun hersenstam zitten . In vergelijking met de vrolijke spot van Cerný komt Pavel CZácek , de vroegere student journalistiek , dodelijk serieus over .
    people here [have] precisely knedliky in their brainstems. In comparison to the cheerful spot/joke of Cerný looks Pavel CZácek, previously a journalism student, deadly serious.
  2. nieuwe partij Nieuw Rechts , omdat de inhoud te racistisch zou zijn . De spot is suggestief , racistisch en discrimineert , zegt commercieel directeur Theo van der Gun
    new party New Right, because the content was too racist. The spot is suggestive, racist and discriminates, says commercial director Theo van der Gun
  3. was er enkel televisiereclame op de Franstalige zenders , maar plots moesten er ook Nederlandstalige spots ingesproken worden , zegt Ramaekers . Ook hij beaamt dat de sector de afgelopen
    there was only TV advertisement on the French speaking networks, but suddenly there also had to be Dutch speaking spots included, says Ramaekers. He agrees too that de sector the last
Table 9. Weighted sense matrix of ‘spot’ senses
senses spot_1 spot_2 spot_3 between not_listed unclear wrong_lemma
spot_1 4.74 0 1 0 0 0 0
spot_2 5 4.35 3 0 1 3.67 0
spot_3 0 0 4.27 0 1 2.33 1.8
not_listed 0 0 1 0 3.4 0 0
unclear 4 3 0 0 0 1.17 0
no_agreement 2 3 3.83 0 0 0 3

Nephology of spot

A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 144 models of spot created on 10/03/2020, modeling between 220 and 235 tokens. The stress value of the MDS solution for the cloud of models is 0.106.

The grouping of the cloud of models seems to come firstly from an interaction between FOC-POS and PPMI, so that the right half belongs to those with FOC-POS:all + PPMI:selection | PPMI:no, the bottom left quarter is populated by PPMI:weight models and the top left quarter by FOC-POS:nav + PPMI:selection | PPMI:no (Figure 9). Each smaller group defined by a combination of those two parameters is further split by FOC-WIN (Figure 10). Other parameters seem to have a very weak effect, although LENGTH:FOC seems to form a group within the FOC-POS:nav area. The stress values of the MDS solutions of these models range between 0.165 and 0.264.

Figure 9. Cloud of models of 'spot' colored by `PPMI`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=spot'> here</a>.

Figure 9. Cloud of models of ‘spot’ colored by PPMI. Explore it here.

Figure 10. Cloud of models of 'spot' colored by `FOC-WIN`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=spot'> here</a>.

Figure 10. Cloud of models of ‘spot’ colored by FOC-WIN. Explore it here.

To compare the effect of the stronger models, we will first set the weaker parameters to SOC-WIN:4 + SOC-POS:nav + LENGTH:FOC, and initially discard PPMI:selection.

The distance matrix between the 8 remaining models (Distance matrix 2) suggests a main grouping between PPMI:weight and PPMI:no models, in the sense that the models of the former tend to be more similar to each other (with distances ranging between 0.21 and 0.41) than the latter (0.40-0.73). The one most different to the rest is the one without any restrictions, and is represented by the MDS with a dense center and some outliers. Models with FOC-POS:nav leave out a large number of tokens, and a few more with PPMI:weight added. These tokens turn out to be mostly the “magazine” group (tokens with spot_3 ‘spotlight’ as majority sense and not_listed as minority sense, with comments on the line of “probably the name of a magazine”. An example is reproduced in (10).) The models that don’t remove them group them tightly in all solutions, both MDS and t-SNE with any perplexity.

Distance matrix 2. Distance matrix between some models of ‘spot’
mapIndex foc_foc_pos foc_ppmi foc_foc_win
1 nav weight 10_10
2 nav no 10_10
3 all weight 10_10
4 all no 10_10
5 nav weight 5_5
6 nav no 5_5
7 all weight 5_5
8 all no 5_5
  1. Spots op tweede Walem moest tegen Broechem in laatste instantie nog James Van Vaerenbergh
    Spots on the second Walem had to face Broechem as a last resort yet James Van Vaerenbergh2

An inspection without color coding shows that non weighted MDS solutions are more sensitive to certain outliers; the t-SNE models tend to show two pockets and a scattered mass, the pockets being small with perplexity 5, one of them growing for perplexity 20, and the scattered mass becoming more scattered with increasing perplexity. With perplexity 50, the pockets are only visible in PPMI:weight models and, maybe, in FOC-POS:nav + FOC-WIN:5 models.

Color coding lets us identify the clear group of hot spot tokens, that cluster even in the MDS solutions. The rest of the cloud is clearly split between the homonyms and less clearly between senses, with denser clouds surrounded by some outliers in the PPMI:no models. The t-SNE models show a very good split between homonyms even with low perplexity, nicer for PPMI:weight models, while the PPMI:no ones become too disperse too soon. Perplexity of 50 seems too high, the tokens are too scattered. The big pocket belongs to spot_1 ‘ridicule’, while the biggest cloud joins the senses of the second homonym. Strangely enough, some tokens of spot_1 are included in the big cloud (the same in different solutions), so they would be worth a deeper investigation.

Fixing stronger parameters to look at the weaker ones (FOC-WIN:10 + PPMI:weight + LENGTH:FOC) gives us very similar clouds. The largest distance between the remaining models is 0.46, and except for the FOC-POS:nav + SOC-WIN:4, with a value of 0.34, the distance between models that only diverge in SOC-POS is 0.0. With PPMI:selection, that distance is even smaller, but the distance between FOC-POS:nav models and FOC-POS:all + SOC-WIN:4 reaches 0.68. If we also set LENGTH:5000, those distances rise to 0.88-0.9.

“spot” seems to offer a good example of the granularity of homonymy versus polysemy, but some cases require further observation. I would choose models with PPMI:weight + FOC-WIN:10 + SOC-POS:nav + SOC-WIN:4 + LENGTH:FOC | LENGTH:5000. Selecting the FOC-POS would depend on whether we want to model the single “Spot op” tokens or not. This is likely to turn out different without sentence boundaries.


staal

The noun staal was tagged with 4 definitions, reproduced in Table 10. The homonyms correspond roughly to “steel” (referring to the material in staal_1 and to an object made of it in staal_2) and “sample” (in a general sense in staal_3 and with the specific connotation of “evidence” in staal_4). Both homonyms are then polysemous, although with very skewed distributions, and staal_4 was not recorded in the original estimation.

Table 10. Definitions of ‘staal’.
code definition example freq
staal_1 1.1 zeer hard ijzer met laag koolstofgehalte twaalf ton staal, ijzer en staal, een man van staal 21
staal_2 1.2 voorwerp of deel van een voorwerp uit zulk metaal het staal van de velgen is verroest 3
staal_3 2.1 monster van een stof of materiaal, bij wijze van proef een staal vragen, goederen op staal verkopen 10
staal_4 2.2 proef, voorbeeld, bewijs een staaltje van hun kunnen, een staaltje van bewaamheid 0

Sense distribution

The sample consists of 320 tokens (8 batches) out of 5796 occurrences in the QLVLNewsCorpus; the distribution of the majority senses of each batch, as well as the pilot-based estimate and the overall distribution, are reproduced in Figure 11. The overall distribution resembles the estimation, with a great majority of staal_1 ‘steel-material’ cases, then staal_3 ‘sample-general’ and staal_2 ‘steel-object’. The number of cases without an assigned sense is much smaller, and staal_4 ‘sample-evidence’ is shown to occur. The distribution does seem to vary a lot between batches. We can say then that:

“staal” is a noun with two homonyms of very different frequencies, both with different senses of very different frequencies.

Figure 11. Distribution of majority senses of 'staal' per batch

Figure 11. Distribution of majority senses of ‘staal’ per batch

Confusion matrix

The confusion matrix between the majority senses and other tagged senses can be seen in Table 11 (raw number of tokens with such senses assigned) and Table 12 (mean confidence of such sense annotation in each token).

We expect some confusion between the senses of each homonym, with little or no confusion between homonymous items. That seems indeed to be the case: there is barely any confusion between homonyms, and something between senses of the same homonym, particularly cases where the majority assigned a more general sense (staal_1 ‘steel-material’ or staal_3 ‘sample-general’) and a minority the more specific case (staal_2 ‘steel-object’ or staal_4 ‘sample-evidence’ respectively), which is to be expected if the context is not very precise. In fact, none of the cases of staal_4 ‘sample evidence’ shows full agreement, and only 0.38 of staal_2 ‘steel-object’ does. There are very few cases that couldn’t be assigned any of the given tags; only one where the annotators couldn’t agree.

Table 11. Non weighted sense matrix of ‘staal’ senses. Proportion of tokens with full agreement per sense-tag is: staal_2: 0.38, staal_1: 0.69, staal_3: 0.59, wrong_lemma: 1. Proportion of tokens with full agreement per homonym is: steel: 0.96, sample: 0.93, geen: 0.33.
steel
sample
geen
senses staal_1 staal_2 staal_3 staal_4 not_listed unclear wrong_lemma
staal_1 229 62 1 3 4 1 0
staal_2 7 13 0 0 0 1 0
staal_3 4 0 66 22 0 1 0
staal_4 0 0 8 8 0 0 0
not_listed 0 0 1 0 1 0 0
unclear 0 0 1 0 0 1 0
wrong_lemma 0 0 0 0 0 0 1
no_agreement 1 0 1 0 0 1 0
total 241 75 78 33 5 5 1

In broad terms, mean confidence seems high in cases of agreement. The cases with staal_2 ‘steel-object’ as majority sense seem to get a lower confidence, although those with staal_1 ‘steel-material’ as majority and staal_2 ‘steel-object’ as minority have a higher mean confidence. On the other hand, both in agreement and disagreement with each other, staal_3 ‘sample-general’ and staal_4 ‘sample-evidence’ annotations received a relatively high confidence.

Table 12. Weighted sense matrix of ‘staal’ senses
senses staal_1 staal_2 staal_3 staal_4 not_listed unclear wrong_lemma
staal_1 4.02 4.05 3 2.33 3.5 1 0
staal_2 3.29 3.28 0 0 0 1 0
staal_3 3.5 0 4.26 4.18 0 3 0
staal_4 0 0 4.25 4.06 0 0 0
not_listed 0 0 3 0 2 0 0
unclear 0 0 5 0 0 3 0
wrong_lemma 0 0 0 0 0 0 5
no_agreement 0 0 3 0 0 0 0

Nephology of staal

A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 144 models of staal created on 10/03/2020, modeling between 313 and 319 tokens. The stress value of the MDS solution for the cloud of models is 0.152.

The most dividing parameter is FOC-WIN, splitting the cloud of models along the second dimension. Within each half, the most clear divisions are made by PPMI (Figure 12), FOC-POS and, surprisingly enough, SOC-WIN, although that distance is minimal when FOC-POS:nav (Figure 13). The stress values of the MDS solutions of these models range between 0.199 and 0.253.

Figure 12. Cloud of models of 'staal' colored by `PPMI`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=staal'> here</a>.

Figure 12. Cloud of models of ‘staal’ colored by PPMI. Explore it here.

Figure 13. Cloud of models of 'staal' colored by `FOC-POS`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=staal'> here</a>.

Figure 13. Cloud of models of ‘staal’ colored by FOC-POS. Explore it here.

To compare the effect of weaker variables, we selected those with SOC-POS:all + LENGTH:FOC + PPMI:weight | PPMI:no. We compared the outcomes of different SOC-WIN but they didn’t seem that relevant. The distances between the models in each SOC-WIN group range between 0.16 and 0.46/0.51. Distance between models that only vary across that parameter go rarely above 0.2.

Few tokens are discarded by restrictive models for this type. Without color coding, we can see that non weighted models tend to have a dense core with some outliers in the MDS solutions, while weighted models are more evenly dispersed; in the t-SNE models, some reasonable clusters begin to show with perplexity of 20, taking a nicer shape with perplexity of 30 for the weighted models and somewhat less interesting with perplexity of 50.

Color coding shows that homonyms are well distinguished; the subclouds are even almost separated in MDS solutions of restrictive models. The “sample” homonym sticks together in all t-SNE solutions, while the “steel” tokens are widely spread, forming two or three clusters that become clear with perplexity of 20 or 30 but not so much with 2 or 50. Senses are not clearly distinguished in the MDS solutions, but staal_4 ‘sample-evidence’ seems to cluster neatly in all t-SNE solutions (specially PPMI:weight models with perplexity of 20 or higher), in a small group with other tokens of the same homonym. Some of the tokens of this sense remain scattered around the cloud of the other homonym, so they require further inspection.

“staal” seems to model the distinction between homonyms neatly, but has a hard time identifying the senses of the annotation. The models do offer other clusters that I would like to inspect and define. Because of the shape of the subclouds and the variety between the options, I would like to further inspect cases of FOC-WIN:10 + FOC-POS:nav + PPMI:weight + SOC-POS:nav to look into the difference between LENGTH and SOC-WIN values.


stof

The noun stof was tagged with 5 definitions, reproduced in Table 13. Both its homonyms are polysemous, although the first (and most frequent) one presents quite distinct senses, while the distinction in the second one is more subtle. The first homonym includes the senses of “substance” (stof_1), “fabric”(stof_2) and “topic” (stof_3), while the second one equals to “dust”, either in the air that you can breathe (stof_4) or as a powder-like state of a substance (stof_5). The first homonym is expected to be twice as frequent as the second one, with its last sense relatively infrequent. From the pilot annotation, the tags are expected to exhaust the concordances.

Table 13. Definitions of ‘stof’.
code definition example freq
stof_1 1.1 materie, substantie van een bepaald type giftige stoffen, vaste stof, grijze stof 15
stof_2 1.2 weefsel wollen en katoenen stoffen 11
stof_3 1.3 onderwerp waarover men spreekt, schrijft, nadenkt etc. stof voor een roman, stof tot onenigheid 4
stof_4 2.1 massa zeer kleine droge deeltjes van verschillende oorsprong, door de lucht meegevoerd een wolk stof, stof afnemen 2
stof_5 2.2 massa zeer kleine deeltjes als toestand van een specifieke substantie iets tot stof vermalen, tot stof verpulveren 8

Sense distribution

The sample consists of 320 tokens (8 batches) out of 24502 occurrences in the QLVLNewsCorpus; the distribution of the majority senses of each batch, as well as the pilot-based estimate and the overall distribution, are reproduced in Figure 14.

The overall frequency is quite similar to the expected one, except that the stof_4 ‘dust’ sense was tagged much more frequently than stof_5 ‘powder’, contrary to what I did in my pilot sample. That said, it is also very clear that the mean confidence of those tokens is quite low. There is also a small number of cases that seemed particularly challenging to annotate.

“stof” is a noun with two homonyms of different frequencies, both polysemous, the most frequent having one frequent sense and two less frequent ones, and the infrequent one having two senses with a skewed distribution.

Figure 14. Distribution of majority senses of 'stof' per batch

Figure 14. Distribution of majority senses of ‘stof’ per batch

Confusion matrix

The confusion matrix between the majority senses and other tagged senses can be seen in Table 14 (raw number of tokens with such senses assigned) and Table 15 (mean confidence of such sense annotation in each token).

We would expect quite some confusion between the two senses of the “dust” homonym (stof_4 and stof_5), more than between the senses of the first homonym, but also some confusion between stof_1 ‘substance’ the “dust” senses could be acceptable. Still, the annotators were made aware of a distinction made between the homonyms (through the numbers of the definitions… assuming they did pay attention to them), so they might be more careful about the difference between them.

In general terms, there is indeed little overlap between the senses. There are more annotations of the “dust” senses in cases with stof_1 ‘substance’ as majority than of other senses. However, there is also an unexpectedly high number of cases where the majority assigned stof_3 ‘topic’ and a minority assigned one of the “dust” senses. These are mostly instances of figurative expressions such as het stof doen opwaaien ‘kick up the dust (create chaos)’, where there is a predominat “topic” theme. There is also a relatively high number of cases without any agreement between annotators, some of which are also idiomatic expressions (door het stof gaan ‘bite the dust’ or het stof doen opwaaien ‘kick up the dust’).

Table 14. Non weighted sense matrix of ‘stof’ senses. Proportion of tokens with full agreement per sense-tag is: stof_1: 0.9, stof_2: 0.85, stof_3: 0.69, stof_4: 0.54. Proportion of tokens with full agreement per homonym is: substance: 0.87, dust: 0.69.
substance
dust
geen
senses stof_1 stof_2 stof_3 stof_4 stof_5 not_listed unclear wrong_lemma
stof_1 141 2 0 6 6 1 0 0
stof_2 4 53 3 2 0 0 0 0
stof_3 1 2 54 9 2 2 2 1
stof_4 1 2 9 52 11 1 2 0
stof_5 2 1 0 3 5 0 0 0
unclear 0 0 0 2 0 0 2 0
no_agreement 3 1 9 11 3 5 3 0
total 152 61 75 85 27 9 9 1

While for the senses of the first homonym the confidence levels seem to be high (even when assigning to a case where the majority sense is from the other homonym), those of the second one tend to be rather low. The doubtfulness comes probably from trying to choose between the subtly different senses of “dust”, rather than from distinguishing between homonyms.

Table 15. Weighted sense matrix of ‘stof’ senses
senses stof_1 stof_2 stof_3 stof_4 stof_5 not_listed unclear wrong_lemma
stof_1 4.25 4.5 0 2.5 3.17 3 0 0
stof_2 3.5 4.48 5 2.5 0 0 0 0
stof_3 4 2 4.47 4 4 3.5 0 0
stof_4 4 4.5 3.11 3.89 3.09 0 0 0
stof_5 3 5 0 2.67 3.2 0 0 0
unclear 0 0 0 3.5 0 0 1.75 0
no_agreement 3.67 4 3.61 3.41 3.67 3.6 2.33 0

Nephology of stof

A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 144 models of stof created on 10/03/2020, modeling between 314 and 320 tokens. The stress value of the MDS solution for the cloud of models is 0.158.

The strongest division between models is given by the FOC-WIN parameter along the vertical dimension, and then by FOC-POS (Figure 15) and PPMI (particularly PPMI:weight vectors against PPMI:selection and PPMI:no, Figure 16). The stress values of the MDS solutions of these models range between 0.203 and 0.263.

Figure 15. Cloud of models of 'stof' colored by weighting. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=stof'> here</a>.

Figure 15. Cloud of models of ‘stof’ colored by weighting. Explore it here.

Figure 16. Cloud of models of 'stof' colored by first order window. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=stof'> here</a>.

Figure 16. Cloud of models of ‘stof’ colored by first order window. Explore it here.

To compare the effect of the stronger parameters I’ll first fix the weaker ones to SOC-POS:nav + SOC-WIN:4 + LENGTH:FOC, initially discarding the PPMI:selection (when I later compared, it did give results somewhere between the other two). In the resulting selection of models we can see that very few tokens are lost by the restrictions (only 1 with FOC-POS:nav + FOC-WIN:5; 3 with FOC-POS:nav + PPMI:weight and 6 with all three restrictions). The distances in the distance matrix (Distance matrix 2) range between 0.2 (between models with FOC-POS:all + PPMI:weight, followed by 0.23 for models with FOC-POS:all + PPMI:no) and 0.77 (between the strictest and the weakest restrictions). The strictest model is the one most different to all the rest.

Distance matrix 3. Distance matrix between some models of ‘stof’
mapIndex foc_foc_pos foc_ppmi foc_foc_win
1 nav weight 10_10
2 nav no 10_10
3 all weight 10_10
4 all no 10_10
5 nav weight 5_5
6 nav no 5_5
7 all weight 5_5
8 all no 5_5

Without color coding, we can see that MDS solutions with PPMI:no tend to have a dense core with some outliers (rather like satellites) and that neat clusters start to form with perplexity of 20 for the t-SNE solutions. There is one particular small ball in the periphery, especially in the PPMI:weight models, and two big subclouds, especially in the FOC-POS:nav models. This structure becomes much clearer with perplexity 30, but raising it to 50 merges those big subclouds and hides the small ball in the PPMI:no models.

Color coding lets us see that homonyms seem to group together in MDS solutions, but rather spread around and with a lot of overlap; senses also seem to group together but with big overlap, and stof_5 ‘powder’ is quite disperse in PPMI:no models. In t-SNE models, there is no clear division of homonyms for perplexity of 5 but it does become clear when it’s 20. The little ball, persistent across all models, has tokens of different senses, but that is an artifact of the annotators’ confusion, since those tokens clearly correspond to the idiom “de stof doen opwaaien”. From perplexity 30, one of the big clouds belongs to stof_1 ‘substance’, while the other one is split between stof_2 ‘fabric’ and stof_4 ‘dust’ and there’s a smaller, less compact one for stof_3 ‘topic’. Perplexity 50 does not improve the stucture; only for FOC-POS:nav models does it stay relatively clear.

To look at the variation between weaker parameters, I fixed the strongest one to a combination that already seems to work out fine: FOC-POS:nav + PPMI:weight + FOC-WIN:10 (a stricter FOC-WIN doesn’t seem to improve the results). I also disregarded LENGTH:10000. The distance matrix between the models in the resulting selection has values ranging from 0.07 (between the SOC-POS:nav + LENGTH:5000 models) and 0.49 (between the SOC-POS:nav + SOC-WIN:4 + LENGTH:FOC and the SOC-POS:all + SOC-WIN:4 + LENGTH:5000 models). The model with SOC-POS:nav + SOC-WIN:4 + LENGTH:FOC is the most different to the rest and the one I like the most. The models with LENGTH:FOC keep the stof_3 ‘topic’ cluster clear away from the rest, while LENGTH:5000 leaves it more disperse and closer to the bigger masses. I checked the previous combination of fixed weak parameters, replacing LENGTH:FOC with LENGTH:5000, and the results were indeed less structured than the ones I described, to the point where PPMI:no t-SNE models quickly merged into one cluster.

The “stof” models don’t seem too bad at identifying homonyms but they do combine two senses of different homonyms into a cluster, so I would like to look into that. They also successfully identify an idiomatic expression. For further inspection I would select models with FOC-POS:nav + PPMI:weight + LENGTH:FOC + SOC-POS:nav + SOC-WIN:4.


schaal

The noun schaal was tagged with 6 definitions, reproduced in Table 16. Both homonyms are polysemous but have very different frequencies. The first one, roughly equivalent to (abstract) “scale”, is estimated to present mostly the “on big scale” sense (schaal_3) and fewer cases of specific scales, either precising the relation between sizes (schaal_2) or with a name or range (schaal_1). The second, infrequent homonym was mostly registered in the sense of “dish” (schaal_5) but could also occur with the “shell” (schaal_4) or the “scale-dish” (schaal_6) senses.

Table 16. Definitions of ‘schaal’.
code definition example freq
schaal_1 1.1 een geordende reeks cijfers, afstanden, hoeveelheden e.d. waarmee iets gemeten wordt de schaal van Celsius, Richter, op een schaal van 1 tot 5 0
schaal_2 1.2 de verhouding tussen de grootte van iets en de weergave ervan in een kaart, model, grafiek etc. een schaal van 1:20, een schaal van 10 km 6
schaal_3 1.3 grootteorde, omvang de schaal van een probleem, op grote/kleine schaal 24
schaal_4 2.1 harde buitenbekleding van zekere organische zaken de schaal van een ei, de schalen van een mossel 0
schaal_5 2.2 ondiepe en wijde schotel een schaal met vruchten 4
schaal_6 2.3 elk van de beide schotels die aan de armen van een balans hangen gewicht in de schaal leggen 0

Sense distribution

The sample consists of 320 tokens (8 batches) out of 14249 occurrences in the QLVLNewsCorpus; the distribution of the majority senses of each batch, as well as the pilot-based estimate and the overall distribution, are reproduced in Figure 17.

The overall distribution resembles indeed the estimate, adding some presence of the senses schaal_1 ‘scale-range’ (not distinguished from schaal_2 ‘scale-transformation- in the pilot sample) and schaal_6 ’dish-scale’. schaal_4 ‘shell’ would seem to occur only once in the whole concordance.

“schaal” is a noun with two homonyms of different frequencies, both polysemous with senses of different frequencies.

Figure 17. Distribution of majority senses of 'schaal' per batch

Figure 17. Distribution of majority senses of ‘schaal’ per batch

Confusion matrix

The confusion matrix between the majority senses and other tagged senses can be seen in Table 17 (raw number of tokens with such senses assigned) and Table 18 (mean confidence of such sense annotation in each token).

We would expect no confusion between tokens of different homonyms, but disagreement between the senses within each homonym would be acceptable. Confusion in the “dish” set of senses could be attributed to unclear or unspecified contexts, while that between “scale” senses could also be due to a lack of understanding of the differences between them.

Table 17. Non weighted sense matrix of ‘schaal’ senses. Proportion of tokens with full agreement per sense-tag is: schaal_3: 0.93, schaal_1: 0.93, schaal_5: 0.95, schaal_2: 0.75, schaal_6: 0.79, unclear: 0.33, schaal_4: 1. Proportion of tokens with full agreement per homonym is: scale: 0.98, dish: 0.9, geen: 0.2.
scale
dish
geen
senses schaal_1 schaal_2 schaal_3 schaal_4 schaal_5 schaal_6 between not_listed unclear
schaal_1 30 0 2 0 0 0 0 0 0
schaal_2 1 12 2 0 0 0 0 0 0
schaal_3 8 3 208 0 0 1 2 1 0
schaal_4 0 0 0 1 0 0 0 0 0
schaal_5 0 0 1 0 39 0 1 0 0
schaal_6 0 0 1 0 1 19 0 2 0
unclear 0 0 1 0 1 0 0 0 3
no_agreement 5 2 5 0 2 3 0 2 3
total 44 17 220 1 43 23 3 5 6

We find indeed very little overlap, either between senses of the same homonym or between schaal_3 ‘scale-main’ the “dish” senses. The schaal_3 ‘scale-main’ tokens that were assigned senses from another homonym are reproduced in (11) through (13): (11) received the schaal_6 ‘dish-scale’ tag, and (12) and (13) received the between tag with the explanation that both schaal_1 ‘scale-range’ and schaal_3 ‘scale-main’ could be appropriate. (14) shows the token with schaal_5 ‘dish’ as majority sense and schaal_3 ‘scale-main’ as minority sense. The other overlapping instances are not so interesting.

  1. crimineel wapengebruik . Juist daarom vreest Geoffrey Bindman , een vooraanstaand mensenrechtenactivist , een glijdende schaal : De bewapening van agenten om criminaliteit de kop in te drukken is iets
    criminal use of weapons. Precisely because of that Geoffrey Bindman, a prominetn activist for Human Rights, feared a slippery slope [argument] (lit. a slippery dish): Arming agents to push the buttons of crime is a bit
  2. het feest van de hoogst individuele expressie en de behoefte aan authenticiteit . En de schaal van de menselijke maat wordt opnieuw uitgevonden . Waar anders dan in het woninginterieur wordt
    the feast/celebration of the highest individual expression and the need of authenticity. And the scale of the human measure is invented again. Where else but in the interior of the home is
  3. verscheen tien jaar geleden de westerse wasmachine . Kijk , dat is globalisering op menselijke schaal . Een ouderwetse bak nog , die je met water vult , waarin je zeep
    ten years ago the western washing machine appeared. Look, that is globalization on a human scale. Still an old fashioned tank that you fill with water, where you [add] soap
  4. Feyenoord dit seizoen niets . Geen beker , geen titel . PSV krijgt straks de schaal die hoort bij de kampioen van Nederland . De verklaring voor het Brabantse succes is
    Feyenoord [got] nothing this season. No cup, no title. PSV will soon recieve the dish that belongs to the champion of the Netherlands. The explanation for the Brabantic success is

Overall, the agreeing annotations have a rather high confidence. There are still some disagreeing ones with high confidence, which could be due to the annotators’ tendency, given that it was only one in each case. The single “shell” annotation, which seemed unproblematic (only filled cell in the corresponding row and column), somehow didn’t receive the highest confidence from all annotators.

Table 18. Weighted sense matrix of ‘schaal’ senses
senses schaal_1 schaal_2 schaal_3 schaal_4 schaal_5 schaal_6 between not_listed unclear
schaal_1 4.59 0 5 0 0 0 0 0 0
schaal_2 5 4.42 2.5 0 0 0 0 0 0
schaal_3 3.75 3.67 4.56 0 0 3 2.5 2 0
schaal_4 0 0 0 4.75 0 0 0 0 0
schaal_5 0 0 1 0 4.51 0 2 0 0
schaal_6 0 0 5 0 5 4.51 0 4 0
unclear 0 0 4 0 5 0 0 0 1.39
no_agreement 2.8 3.5 3.8 0 4 2.33 0 1.5 3.33

Nephology of schaal

A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 144 models of schaal created on 10/03/2020, modeling between 308 and 320 tokens. The stress value of the MDS solution for the cloud of models is 0.121.

The subclouds of the cloud of models are less distinct than for other lemmas, but color coding does show important groupings, with FOC-POS:all + PPMI:selection | PPMI:no on the right side, PPMI:weight on the top left quadrant and FOC-POS:nav + PPMI:selection | PPMI:no on the bottom left quadrant, more or less, closer to the PPMI:weight area than to the FOC-POS:all area (Figure 18). Both SOC-WIN and FOC-WIN seem to group models as well, the former pushing SOC-WIN:10 models to the inside and SOC-WIN:4 to the outside (Figure 19), the latter splitting the three main areas separately. The stress values of the MDS solutions of these models range between 0.192 and 0.263.

Figure 18. Cloud of models of 'schaal' colored by `PPMI`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=schaal'> here</a>.

Figure 18. Cloud of models of ‘schaal’ colored by PPMI. Explore it here.

Figure 19. Cloud of models of 'schaal' colored by `SOC-WIN`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=schaal'> here</a>.

Figure 19. Cloud of models of ‘schaal’ colored by SOC-WIN. Explore it here.

In order to compare the effects of the strongest parameters, I selected clouds with SOC-POS:nav + LENGTH:FOC + PPMI:weight | PPMI:no. I compared sets with different SOC-WIN separately and, while there are some differences (particularly under certain conditions), I cannot decide which one works better yet.

These models don’t exclude so many tokens: 4 with FOC-POS:nav + FOC-WIN:5, 6 with FOC-POS:nav + PPMI:weight and 12 with all three restrictions. In the distance matrix between these models (Distance matrix 4), the highest value is between the strictest and the least strict model (0.72 with SOC-WIN:10, 0.75 with SOC-WIN:4). With SOC-WIN:10, FOC-WIN:5 makes for a bigger difference between models that differ in PPMI (0.65-0.7), but not so much when they differ in FOC-POS. With SOC-WIN:4, instead, the distance matrix individuates the FOC-POS:all + PPMI:no models as the most different to the rest, as well as one of the pairs of lowest similarity, with 0.35.

Distance matrix 4. Distance matrix between some models of ‘schaal’
mapIndex foc_foc_pos foc_ppmi foc_foc_win soc_soc_win
1 nav weight 10_10 4_4
2 nav weight 10_10 10_10
3 nav no 10_10 4_4
4 nav no 10_10 10_10
5 all weight 10_10 4_4
6 all weight 10_10 10_10
7 all no 10_10 4_4
8 all no 10_10 10_10
9 nav weight 5_5 4_4
10 nav weight 5_5 10_10
11 nav no 5_5 4_4
12 nav no 5_5 10_10
13 all weight 5_5 4_4
14 all weight 5_5 10_10
15 all no 5_5 4_4
16 all no 5_5 10_10

The MDS solutions seem to be more disperse (or less concentrated) with PPMI:weight, showing one big cloud with some satellites and even a smaller cloud close by (or two, for FOC-WIN:10 + PPMI:weight models). The smaller cloud is mostly made of tokens from the second homonym “dish”.

The t-SNE solutions show three small clusters around a mass that turns from an archipelago with perplexity 5 to a more compact mass with perplexity 20 and a more disperse one with perplexity 50. I don’t see much difference between the solutions with perplexity 20 and 30. The three small clusters correspond quite well to tokens annotated with schaal_1 ‘scale-range’, schaal_5 ‘dish’ and schaal_6 ‘dish-scale’. The three senses do cluster quite neatly even in the MDS solution. Furthermore, PPMI:weight models show a small cluster of tokens with so

A main difference between models with different FOC-POS for this lemma is whether they make use of prepositions such as op and in. On the one hand, they do characterize certain usages. On the other, they are very frequent in the sample (particularly op) so they are not necessarily distinctive. In any case, FOC-POS:nav cases do not perform worse than the FOC-POS:all counterparts, relying mostly on adjectives.

Finally, I made another selection with FOC-WIN:10 + FOC-POS:nav + SOC-POS:nav, fixing also first LENGTH:FOC and then LENGTH:5000, to compare the effects of all three PPMI values and SOC-WIN. I only looked at the distance matrix and the t-SNE solutions with perplexity of 30 (I didn’t get any further insight from looking at the MDS solutions). One observation is that all models retain the three small clusters and that the main schaal_3 ‘scale-main’ cloud is more elongated with PPMI:weight and seems to have some pockets with PPMI:selection. In the distance matrix, SOC-WIN makes the biggest difference between PPMI:selection models and the smallest between PPMI:no models. With LENGTH:5000. the main cloud is always more disperse and SOC-WIN makes a bigger difference: from 0.24 between PPMI:no models and 0.5 between PPMI:weight models. The difference between PPMI:weight and PPMI:no models is also bigger in this case; the latter gives much more disperse points.

“schaal” shows a suprisingly good division between homonyms and senses, regardless of their frequencies, and while PPMI:weight seems better at picking up a collocation that is not so relevant, all clouds seem to perform similarly. For further inspection I would choose FOC-WIN:10 + PPMI:selection | PPMI:weight + LENGTH:FOC + SOC-WIN:4 + SOC-POS:nav. I think this is a particularly fruitful case to understand the role of prepositions.


blik

The noun blik was tagged with 6 definitions, reproduced in Table 19. The homonyms are both polysemous, with a more clear polysemy in the most frequent one (and roughly equally frequent senses) and a more subtle metonymy (conceptually not challenging, but strongly dependent on the clarity of the context) for the other. The former corresponds to English “gaze/look”, with separate senses for physically looking at something (blik_1), the (eyes-focused) facial expression (blik_2) and metaphorical, intellectual look (blik_3). The latter means “tin” and could refer to the material itself (blik_4), an object (like a tin can, blik_5) or canned food (blik_6). In my pilot sample I didn’t find instances of blik_6 and found it hard to distinguish between the other senses of “tin”, particularly given the frequency of this homonym relative to the other one.

Table 19. Definitions of ‘blik’.
code definition example freq
blik_1 1.1 oogopslag een blik werpen op iets, een blik van verstandhouding 10
blik_2 1.2 gezichtsvermogen een scherpe blik 12
blik_3 1.3 inzicht, in intellectuele zin een brede blik 11
blik_4 2.1 dun geplet metaal, i.h. bijz. vertind dun plaatstaal dozen uit blik 6
blik_5 2.2 voorwerp (i.h.bijz. doos voor voedsel) vervaardigd uit zulk materiaal stoffer en blik, een blik erwtjes, een maaltijd uit blik 0
blik_6 2.3 voedsel bewaard in een voorwerp als bedoeld in 2.2 eet je niet teveel blik? 0

Sense distribution

The sample consists of 280 tokens (7 batches) out of 22175 occurrences in the QLVLNewsCorpus; the distribution of the majority senses of each batch, as well as the pilot-based estimate and the overall distribution, are reproduced in Figure 20.

The pilot sample seems to have underestimated the frequency of blik_1 ‘look’, which appears as frequent as blik_2 ‘look-expression’ and blik_3 ‘look-intellectual’ together in the larger annotation. Between blik_4 ‘tin-material’ and blik_5 ‘tin-object’, the annotators seem to prefer the second sense, and only two cases out of the 280 were primarily tagged as blik_5 ‘canned food’. There is also a high number of cases with no agreement between the annotators.

“blik” is a noun with two homonyms of different frequencies, each with three senses and skewed frequencies.

Figure 20. Distribution of majority senses of 'blik' per batch

Figure 20. Distribution of majority senses of ‘blik’ per batch

Confusion matrix

The confusion matrix between the majority senses and other tagged senses can be seen in Table 20 (raw number of tokens with such senses assigned) and Table 21 (mean confidence of such sense annotation in each token).

We would expect no confusion at all between the homonyms, and probably more confusion between “tin” senses than between “look” senses. Indeed, there is only one case in which most annotators chose blik_3 ‘look-intellectual’ and one tagged blik_4 ‘tin-material’ but otherwise most of the disagreement pertains geen ‘none of the above’ tags. As we can see in Table 21, the confidence of that annotation is low (I checked: also for the other annotators in the same token). The example is copied in ((???)) and proves interesting one to look for in the cloud, given the potential context words in different models.

door één en dezelfde fotograaf . Stuk voor stuk zijn ze gemaakt met eenzelfde sobere blik die tegelijkertijd afstandelijk en inlevend is . Alles op die foto’s is bedoeld om in ((???)) door één en dezelfde fotograaf . Stuk voor stuk zijn ze gemaakt met eenzelfde sobere blik die tegelijkertijd afstandelijk en inlevend is . Alles op die foto’s is bedoeld om in
by one and the same photographer. One by one they were taken with one and the same look that is at the same time detached and empathizing. Everything on those photographs is meant to

There is a high number (5%) of tokens where not two annotators could agree, but a closer inspection to the annotations reveals that the disagreement tends to be between the three different senses of the first homonym (and sometimes between blik_4 ‘tin-material’, blik_5 ‘tin-object’ and unclear or between), so collapsing the homonyms would get rid off the disagreement. On the other hand, the disagreement within the “look” homonym is surprisingly high, especially between blik_1 ‘look’ and blik_2 ‘look-expression’ (which are a matter of perspective and might, indeed, be hard to discriminate in certain contexts). This is particularly revealing when looking a the mean confidence values, which are overall quite low. At the homonym level, however, the proportion of tokens with full agreement is very high in both cases.

Table 20. Non weighted sense matrix of ‘blik’ senses. Proportion of tokens with full agreement per sense-tag is: blik_3: 0.44, blik_5: 0.68, blik_1: 0.52, blik_4: 0.6, blik_2: 0.22. Proportion of tokens with full agreement per homonym is: gaze: 0.98, tin: 0.94.
gaze
tin
geen
senses blik_1 blik_2 blik_3 blik_4 blik_5 blik_6 between not_listed unclear wrong_lemma
blik_1 162 56 21 0 0 0 1 0 0 0
blik_2 27 37 1 0 0 0 0 1 0 0
blik_3 13 5 34 1 0 0 0 0 0 0
blik_4 0 0 0 5 2 0 0 0 0 0
blik_5 0 0 0 5 22 2 0 0 0 0
blik_6 0 0 0 1 1 2 0 0 0 0
not_listed 0 1 0 0 0 0 0 1 0 0
unclear 1 0 0 0 0 0 0 0 1 0
wrong_lemma 1 0 0 0 0 0 0 0 0 1
no_agreement 12 12 11 2 3 0 0 2 3 0
total 216 111 67 14 28 4 1 4 4 1
Table 21. Weighted sense matrix of ‘blik’ senses
senses blik_1 blik_2 blik_3 blik_4 blik_5 blik_6 between not_listed unclear wrong_lemma
blik_1 4.01 3.62 2.76 0 0 0 0 0 0 0
blik_2 3.93 3.18 4 0 0 0 0 3 0 0
blik_3 3.77 2.8 3.68 3 0 0 0 0 0 0
blik_4 0 0 0 4.13 3 0 0 0 0 0
blik_5 0 0 0 3 4.2 5 0 0 0 0
blik_6 0 0 0 4 5 3.5 0 0 0 0
not_listed 0 3 0 0 0 0 0 2.5 0 0
unclear 3 0 0 0 0 0 0 0 3.5 0
wrong_lemma 4 0 0 0 0 0 0 0 0 3.5
no_agreement 3.75 3.58 3.36 2.5 2.67 0 0 0 1.67 0

Nephology of blik

A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 144 models of blik created on 11/03/2020, modeling between 261 and 280 tokens. The stress value of the MDS solution for the cloud of models is 0.148.

The main groups in the cloud of models come from an interaction between FOC-POS and PPMI, with FOC-POS:all + PPMI:selection | PPMI:no on the right side, FOC-POS:all + PPMI:weight in the middle and FOC-POS:nav on the left side, inside which PPMI:no is distinct from PPMI:weight | PPMI:selection (Figure 21). The FOC-POS:all + PPMI:selection | PPMI:no group is split by PPMI along the vertical axis and by SOC-WIN along the horizontal one; some FOC-WIN:10 groups can be found within each PPMI:no group as well as a LENGTH:FOC one within each PPMI:weight group (Figure 22). The stress values of the MDS solutions of these models range between 0.191 and 0.296.

Figure 21. Cloud of models of 'blik' colored by `PPMI`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=blik'> here</a>.

Figure 21. Cloud of models of ‘blik’ colored by PPMI. Explore it here.

Figure 22. Cloud of models of 'blik' colored by `FOC-WIN` (left) and `LENGTH` (right). Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=blik'> here</a>.

Figure 22. Cloud of models of ‘blik’ colored by FOC-WIN (left) and LENGTH (right). Explore it here.

By inspecting the distance matrix with different combinations of parameters, there would seem to be an interaction between LENGTH, SOC-WIN and FOC-WIN: with LENGTH:5000, the difference between models varying only across SOC-WIN or FOC-WIN depends on the value of the other parameter and FOC-WIN settings seem to matter more than SOC-WIN, while with LENGTH:FOC the values seem more independent and equally distinctive.3 Therefore, to compare stronger models I would set SOC-POS:nav + LENGTH:FOC + PPMI:weight | PPMI:no + SOC-WIN:4 but then check the robustness of the descriptions with SOC-WIN:10 and with LENGTH:5000.

The distance matrix between the selected models (Distance matrix 5) shows relatively high values, the smallest being 0.35 (between FOC-POS:all + PPMI:weight models) and the largest 0.94 (between FOC-WIN:10 + FOC-POS:nav + PPMI:weight on the one hand and the FOC-POS:all + PPMI:no models on the other, which present the largest values overall and a distance of 0.48 between each other).

Distance matrix 5. Distance matrix between some models of ‘blik’
mapIndex foc_foc_pos foc_ppmi foc_foc_win
1 nav weight 10_10
2 nav no 10_10
3 all weight 10_10
4 all no 10_10
5 nav weight 5_5
6 nav no 5_5
7 all weight 5_5
8 all no 5_5

The MDS solutions show two outliers to which the PPMI:no models (and especially the least strict model) are quite sensitive, so that all other tokens are joined together in the center and far away from them. PPMI:selection models are much less sensitive. Their concordances are reproduced and discussed in the outliers subsection. They don’t stand out in the t-SNE solutions, although one of them seems to tend to the periphery.

The t-SNE solutions start presenting some clusters with perplexity 20, but PPMI:no models are then just uniform masses. The clusters are more clear with perplexity 30 but still part of the bigger, disperse cloud, and only PPMI:weight + FOC-POS:nav + FOC-WIN:5 still has some of that structure with perplexity 50.

The “tin” homonym does tend to group in all solutions, particularly in PPMI:weight models, but even t-SNE solutions don’t set it apart so clearly. The same goes for blik_2 ‘look-expression’ and blik_3 ‘look-intellectual’, which take special areas in the MDS models but within the bigger “look” cloud; only blik_2 ‘look-expression’ seems to form some sort of cluster in t-SNE solutions with perplexity of 20 or 30.

There are however two clear clusters in t-SNE solutions (perplexity 20 or higher) that don’t match sense tags but collocations: one for “een blik werpen” and one for “de blik richten”. They can be seen on PPMI:weight models and also models with FOC-WIN:5 and/or FOC-POS:nav if perplexity is lower than 50. These clusters and that for blik_5 ‘tin-object’ seem more distinguishable from the main, chaotic cloud in PPMI:weight + FOC-WIN:5 | FOC-POS:nav models (but not so much in PPMI:weight + FOC-WIN:5 + FOC-POS:nav). This also looks much better with SOC-WIN:4 than SOC-WIN:10 and with LENGTH:FOC than with LENGTH:5000. Besides, with FOC-POS:nav, PPMI:selection models perform here very similar to PPMI:weight ones.

The models of “blik” are not very succesful at discriminating the senses used in the annotation, although to a certain degree they can distinguish the homonyms. However, they do tell apart some fixed constructions, so that t-SNE models cluster cases of “een blik werpen” and “de blik richten”. For further inspection I would select models with PPMI:weight + LENGTH:FOC + SOC-WIN:4 + SOC-POS:nav.

Outliers

The outliers in the MDS models are reproduced in (15) and (16). The first one was unanimously assigned the blik_5 ‘tin-object’ tag with confidence values of 3 and 4 and in most MDS solutions is placed next to another token of the same sense. Without any filter, it has 9 first order feature (8 of which are nouns or adjectives) all of them with positive PMI with blik/noun.

  1. 225 gram vet varkensgehakt ; 175 verse , geschilde waterkastanjes of 85 gram waterkastanjes uit blik ; 1 theelepel zout ; theelepel versgemalen zwarte peper ; 3 eetlepels fijngehakte lente-uien ;
    225g minced pork fat; 175 fresh, peeled water chestnut or 85g canned water chestnuts (lit. water chestnut from a can); 1tsp salt; tsp freshly ground black pepper; 3 tbsp finely chopped spring onions;

The second outlier was a real source of confusion: one annotator assigned the blik_1 ‘look-main’ with confidence 5, another blik_2 ‘look-expression’ with confidence 4, and another geen ‘none of the above’ with confidence 0 and the following explanation: “ik heb geen flauw idee wat hiermee bedoelt wordt.”. Even the MDS models that are not so sensitive to outliers push it to the periphery: with or without filters, the only valid context word is “blanco/adj”, with a PPMI of 2.01.

  1. , zei Simons , ’ ik ben Johan Simons van Hollandia . Blanco blik . Dat trek ik me dan persoonlijk aan , ja . ’
    , said Simons, ‘I am Johan Simons from Hollandia. Blank stare. I care about that myself, yes.’

spoor

The noun spoor was tagged with 8 definitions, reproduced in Table 22. There are three homonyms with uneven distribution. The first and most frequent one corresponds roughly to “trace” and comprises senses such as “physical footprint” (spoor_1), “trace, evidence of (previous) presence” (spoor_2), “traces (of a substance in another)” (spoor_3) or “figurative trace/path to follow” (spoor_4, which I didn’t check for in the pilot annotation). The second one refers to the railways (spoor_5) and its metonymic extensions to trains (spoor_6) and railway companies (spoor_7). The last homonym means “spur” but is almost never used in its literal sense in this corpus, while it does occur in the fixed idiomatic expression “zijn sporen verdienen” (prove one’s skills of aptitude for something).

Table 22. Definitions of ‘spoor’.
code definition example freq
spoor_1 1.1 afdruk door iets of iemand op z’n weg achtergelaten het spoor van een fiets op een zandweg, een spoor van vernieling 3
spoor_2 1.2 blijk van aanwezigheid door iets of iemand (ongewild) achtergelaten naar sporen zoeken, iemand op het spoor komen 17
spoor_3 1.3 kleine hoeveelheid sporen van lood in het leidingwater 10
spoor_4 1.4. te volgen of gevolgde weg in figuurlijke zin het juiste spoor 0
spoor_5 2.1 weg met twee rijen metalen staven waarover treinen e.d. rijden niet op het spoor lopen! 3
spoor_6 2.2 de trein als vervoermiddel met het spoor reizen 1
spoor_7 2.3 spoorwegbedrijf bij het spoor werken, het spoor staakt 2
spoor_8 3 metalen punt of wieltje aan de hiel van een rijlaars, gebruikt om het rijdier te prikkelen zijn sporen verdienen 3

Sense distribution

The sample consists of 360 tokens (9 batches) out of 37307 occurrences in the QLVLNewsCorpus; the distribution of the majority senses of each batch, as well as the pilot-based estimate and the overall distribution, are reproduced in Figure 23.

The overall distribution has some minor differences with the expected one. There are less cases of spoor_3 ‘traces-substance’, quite a number of spoor_4 ‘traces-figurative’ and also many tokens where the annotators didn’t reach an agreement. The whole “railway” homonym seems quite infrequent (still, 10% of the tokens), and while the “spur” homonym was expected to be very infrequent, it is remarkable that it keeps occurring.

“spoor” is a noun with three homonyms of different frequencies, the most frequent of which are polysemous with skewed frequencies.

Figure 23. Distribution of majority senses of 'spoor' per batch

Figure 23. Distribution of majority senses of ‘spoor’ per batch

Confusion matrix

The confusion matrix between the majority senses and other tagged senses can be seen in Table 23 (raw number of tokens with such senses assigned) and Table 24 (mean confidence of such sense annotation in each token).

We expect no overlap between homonyms, except maybe between spoor_4 ‘trace-figurative’ and the “railway” senses. The metonymic relation between the senses of this second homonym are also probably harder to determine than those between the senses of the first one, but at the same time its low frequency makes it hard to compare.

There is indeed almost no overlap between homonyms. For spoor_8 ‘spur’, there is only one case of confusion with a token with majority spoor_2 ‘trace-evidence’ (and it must be a mistake in the annotation, because it makes no sense to apply spoor_8 to it), and there are very few between the first two homonyms. The cases of spoor_4 ‘trace-figurative’ where some annotator suggested not_listed only imply a less than perfect understanding of the definitions on the part of the annotator, since according to their comments they were identifying the right (figurative expression) meaning. Finally, 9.44% of the cases show no agreement between annotators, although the annotations themselves seem to carry low confidence. Most of them are disagreements between senses of the first homonym, but there are also some incomprehensible mistakes.

Table 23. Non weighted sense matrix of ‘spoor’ senses. Proportion of tokens with full agreement per sense-tag is: spoor_4: 0.54, spoor_2: 0.44, spoor_1: 0.29, spoor_7: 0.5, spoor_5: 0.71, spoor_8: 1, spoor_3: 0.41, spoor_6: 0.5. Proportion of tokens with full agreement per homonym is: traces: 0.88, railway: 0.93, spur: 1.
traces
railway
spur
geen
senses spoor_1 spoor_2 spoor_3 spoor_4 spoor_5 spoor_6 spoor_7 spoor_8 between not_listed unclear wrong_lemma
spoor_1 49 27 3 4 0 0 0 0 0 1 0 0
spoor_2 41 129 10 14 0 0 0 1 0 2 4 0
spoor_3 5 10 22 1 0 0 1 0 0 0 0 0
spoor_4 5 14 0 72 2 0 0 0 0 7 5 0
spoor_5 1 0 0 2 31 3 3 0 0 0 0 0
spoor_6 0 0 0 0 1 2 0 0 0 0 0 0
spoor_7 0 0 0 0 2 2 8 0 0 0 0 0
spoor_8 0 0 0 0 0 0 0 10 0 0 0 0
not_listed 0 0 0 0 1 0 0 0 0 1 0 0
unclear 1 0 0 0 0 0 1 0 0 0 2 0
no_agreement 25 25 6 12 3 3 0 4 2 9 7 1
total 127 205 41 105 40 10 13 15 2 20 18 1

Overall, the confidence given to the annotations of this lemma is quite low, except for the spoor_8 ‘spur’ (the only one of its homonym) and spoor_5 ‘railway’. This probably has to do with the high number of possible senses within each homonym and the difficulty to distinguish between them in the context given by the concordance.

Table 24. Weighted sense matrix of ‘spoor’ senses
senses spoor_1 spoor_2 spoor_3 spoor_4 spoor_5 spoor_6 spoor_7 spoor_8 between not_listed unclear wrong_lemma
spoor_1 3.42 3.33 4 2.5 0 0 0 0 0 0 0 0
spoor_2 3.59 3.77 3.5 3.43 0 0 0 3 0 2.5 2.75 0
spoor_3 3.4 3.4 3.89 2 0 0 5 0 0 0 0 0
spoor_4 2.2 3.21 0 3.53 4.5 0 0 0 0 3.14 1.2 0
spoor_5 2 0 0 2 4.04 5 3.67 0 0 0 0 0
spoor_6 0 0 0 0 5 3.33 0 0 0 0 0 0
spoor_7 0 0 0 0 1 2.5 3.92 0 0 0 0 0
spoor_8 0 0 0 0 0 0 0 4.28 0 0 0 0
not_listed 0 0 0 0 5 0 0 0 0 3 0 0
unclear 2 0 0 0 0 0 5 0 0 0 1.5 0
no_agreement 2.7 3.24 3 2.33 3.67 3 0 2.5 0 2.44 1.71 5

Nephology of spoor

A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 144 models of spoor created on 11/03/2020, modeling between 342 and 360 tokens. The stress value of the MDS solution for the cloud of models is 0.146.

The most dividing parameter is FOC-POS, splitting the cloud of models along the vertical dimension. Within each half, the most clear divisions are made by PPMI (Figure 24), particularly in the FOC-POS:all area, which is more disperse. In each of those groups, FOC-WIN and SOC-WIN split the models in orthogonal dimensions, especially with FOC-POS:all + PPMI:selection, where SOC-WIN seems stronger, and FOC-POS:all + PPMI:no, but to a lesser degree in the rest of the groups (maybe just because they are more cluttered; see Figure 25). The stress values of the MDS solutions of these models range between 0.236 and 0.279.

Figure 24. Cloud of models of 'spoor' colored by `PPMI`. Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=spoor'> here</a>.

Figure 24. Cloud of models of ‘spoor’ colored by PPMI. Explore it here.

Figure 25. Cloud of models of 'spoor' colored by `FOC-WIN` (left) and `SOC-WIN` (right). Explore it <a href='https://montesmariana.github.io/NephoVis/level1.html?type=spoor'> here</a>.

Figure 25. Cloud of models of ‘spoor’ colored by FOC-WIN (left) and SOC-WIN (right). Explore it here.

In order to compare the strongest parameters, but also accounting for the interaction between PPMI, FOC-WIN and SOC-WIN, I’ll first fix LENGTH:FOC + SOC-POS:nav and then alternatively select different values of PPMI. At the end I also (briefly) compared LENGTH:FOC with LENGTH:5000 and the difference was quite small and not necessarily an improvement.

With any PPMI, the parameter that seems to make the least difference is SOC-WIN, but it does depend on the values of FOC-POS and FOC-WIN. With FOC-POS:nav the differences range between 0.11 and 0.20, against a range of 0.20 to 0.68 with FOC-POS:all; the values increase slightly with FOC-WIN:10 if PPMI:selection and decrease otherwise. All in all, the FOC-POS:all + SOC-WIN:4 models have the highest distance values, but only with PPMI:no + FOC-WIN:all + FOC-POS:nav, in the t-SNE solution of perplexity 30, did I find a setting of SOC-WIN to make a significant improvement. Therefore, we could just default. Distance matrix 6 shows de distance matrix between the PPMI:weight + LENGTH:FOC + SOC-POS:nav models.

Distance matrix 6. Distance matrix between some models of ‘spoor’
mapIndex foc_foc_pos foc_foc_win soc_soc_win
1 nav 10_10 4_4
2 nav 10_10 10_10
3 all 10_10 4_4
4 all 10_10 10_10
5 nav 5_5 4_4
6 nav 5_5 10_10
7 all 5_5 4_4
8 all 5_5 10_10

Given PPMI:weight, the models don’t look very different from each other in the MDS solutions; FOC-POS:all models with other PPMI values are more sensitive to one or two outliers. Only the t-SNE solutions (from perplexity 20) show more distinct clusters with FOC-POS:all. In any case, at most two small dense clusters can be distinguished from the greater, disperse mass without color coding.

Color coding lets us see that the “spur” and “railway” homonyms do tend to stick together in MDS and are more compact in FOC-POS:all models. The difference is more evident in PPMI:selection | PPMI:no. The “spur” homonym has its own cluster in the t-SNE solutions, but it’s so infrequent that it’s very tiny; “railway” forms clusters that don’t stray far away from the main mass and is better rendered FOC-WIN:10 and improved with FOC-POS:all for PPMI:weight or FOC-POS:nav otherwise. There are too many sense tags4 to expect much clarity in the clouds; in principle, it would seem that spoor_3 ‘traces-substance’ and spoor_4 ‘trace-figurative’ stay apart in MDS solutions, and rather compact with FOC-POS:all, while there is some overlap between spoor_3 ‘traces-substance’ and spoor_2 ‘traces-evidence’ and spoor_1 ‘trace-footprint’ is all over the place. The t-SNE solutions don’t look totally random but it’s still hard to find structure in them: those with perplexity 5 are scattered archipelagos and those with perplexity 50 are almost uniform masses; perplexity 30 seems to render the structure a bit better than 20. There are some clusters and they are more compact and have better separability in PPMI:weight models overall (FOC-WIN:10 seems to work particularly well). There is a cluster of spoor_1 ‘trace-footprint’ and spoor_2 ‘traces-evidence’ corresponding to the expression “sporen nalaten”; some for spoor_3 ‘traces-substance’, spoor_4 ‘trace-figurative’, spoor_8 ‘spur’ and spoor_5 ‘railway-main’ as well as a rather populated one with cases of “spoor van de daders”. The t-SNE models have also revealed a number of interesting tokens that keep occurring in the neighborhood of clusters to which they don’t “really” belong. The concordances are reproduced and discussed in the visitors subsection.

For further insight I would look into FOC-WIN:10 + PPMI:weight + SOC-WIN:4 + SOC-POS:nav + LENGTH:FOC and try to figure out how FOC-POS actually impacts the clusters, especially in t-SNE models with perplexity 30.

Visitors

The tokens transcribed in (17) through (19) consistently (in PPMI:weight models) occur inside or next to groups to which a human wouldn’t classify them. I find them interesting as examples of the discrepancies between what the model does and what a human interpreter would do.

Example (17) was unanimously assigned the spoor_1 ‘trace-footprint’ tag (with confidences of 2, 5 and 5) but FOC-WIN:10 + PPMI:weight models, which unlike their FOC-WIN:5 counterparts group the “railway” tokens tightly together, place that token in the middle of the “railway” cluster.

  1. opgetrommeld voor het opruimen van een kilometerslang oliespoor op de Waregemseweg . Het spoor liep zelfs tot aan het station van Anzegem , zodat ook de brandweer van Anzegem
    rounded up for cleaning a trail of oil one kilometer log on the Waregemseweg. The trail run even up to the station of Anzegem, so that also the firefighters from Anzegem

Examples (18) and (19) both tend to orbit the spoor_4 “op het (goede) spoor zetten” cluster, even in FOC-WIN:5 models that don’t hold the cluster so tight. The former was assigned the spoor_7 ‘railway-company’ tag with confidences of 4 and 5 and the spoor_6 ‘railway-train’ tag with confidence of 2, while the latter was assigned the spoor_5 ‘railway-main’ tag by one annotator, with maximum confidence, and the geen ‘none of the above’ tag with confidence of 3 by the other two, both making reference to the fact that it was a fixed expression (one even specified: “op de sporen zetten”). Here, the systematicity of the model allowed us to cluster two tokens where the annotators were lost, maybe because of the order in which they annotated, because the sense was not so clear to them compared to the other 2 or 3 tokens of that expression in their batch.

  1. om een staatshervorming , wordt zo misschien toch een stap gezet richting regionalisering van het spoor . " Al moet de federale niet te veel verwachten , in onze
    a state reform, then maybe it does take a step in the direction of the regionalisation of the railway." But the federal government cannot expect much, in our
  2. de wetenschap en de journalisten om een leefbaar alternatief voor de verwoestende agro-industrie op de sporen te zetten . S. De Clercq / Wevelgem Bio is gezonder
    the science and the journalists to put a livable alternative to the devastating agro-industry on the [right] tracks. S. De Clerq/Wevelgem Bio is healthier

References

De Pascale, S. (2019). Token-based vector space models as semantic control in lexical lectometry (PhD thesis). Retrieved from https://lirias.kuleuven.be/retrieve/549451


  1. It would seem our instructions were not clear enough. All three annotators agreed on the sense tag for this instance of horde, which is “right”, except that it’s not Dutch.

  2. These 12 tokens all come from Het Nieuwsblad, between 2003 and 2004, and occur in the 9th position of their respective articles.

  3. This comparison was performed on models with SOC-POS:nav + PPMI:weight | PPMI:no. Given LENGTH:FOC, distances between models varying only in FOC-WIN range between 0.27 and 0.59 and are higher with PPMI:no, and for models varying only in SOC-WIN, between 0.24 and 0.42 and are higher with PPMI:weight, except for the one with least restrictions, where the distance is 0.69.

  4. I merged the two geen ‘none of the above’ tags (unclear and not_listed) to avoid having more than 10 values in the categorical variable “majority_sense” for spoor.

Mariana Montes

2020-03-26